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Abstract 

Using an ensemble of classifiers instead of a single classifier has been 
shown to improve generalization performance in many pattern recognition 
However, the extent of such hnptov^eat depentH peatly »n 
the amount of correlation among the errors of the base classifiers. There 
fore reducing those correlations while keeping the classifiers performance 
levels high is an important area of research. In this article, we explo 
input decimation (ID), a method which selects feature subset! ; for their 
ability to discriminate among the classes and uses them o ec p 
base classifiers. We provide a summary of the theoreticalbenefi 
relation reduction, along with results of our method on ^o unde^aj 
sonar data sets, three benchmarks from the Probenl/UCI repositories, 
and two synthetic data sets. The results indicate that input donate 
ensembles (IDEs) outperform ensembles whose base aassmers u ^ 
input features; randomly selected subsets of features; and features created 
using principal components analysis, on a wide range of domains. 


1 Introduction 


Using an ensemble of classifiers instead of a single classifier has been repeat- 
edly shown to improve generalization performance in many pattern recogmtio 
problems [9 17 69]. It is well-known that, in order to obtain such impr 
ment one needs to simultaneously maintain a reasonable level of P erfo ™^ e 
in the base classifiers that constitute the ensemble and reduce their corre 

tionsTl 29 44 49 64]. There are many ensemble methods that actively prom 

B-f-s 

about diversity in the base models by training them with different s ubsets oi 
the training set. One drawbar* of such methods is that by defimhon, oni^a 
portion of the available data is used during learning This can lead to ^ poor 
performance, particularly when the data sets are small to begin with. Tram, g 
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the base classifiers using different subsets of features avoids this issue as all the 
patterns can be used in the training while still yielding base Z7d Zl 

inJZ tnUI ™ tb ° dS arc Pri ” d P» Component 
new featirefsiS ftf tl 7,7 paCe X ‘ Mi ° a (10 ’ 25 I PCA MMtmct. 
However PCA thZ , m “ ;ilum vaiiability over those features, 

for ZnZht r 6 ” c ™ b ‘”“S. not only generates the same features 
So accotrlt^Ho h” the P r‘' b “‘ 8180 faik to *«*• nte information 
PCA hut it too does n 

™ *-* 10 

In this paper, we present input decimation-a method of choosing different 
subsets of the original features based on the correlations between fndividual 

“.T T t ™, S h d be ‘ S ; “? * m r S daSSifcS -^TrioTttm 

nmg. This method not only reduces the dimensionality of the data but uses 
. dimensionality reduction to reduce the correlations among the classifiers 

We^MV 67].’ r6by impr ° Ving the Classif5cati °n performance of the ensem- 

. ° ur results indicate that input decimation reduces the error up to 90% over 

l S ot C ZZ7Z r emb ! eS ,rai ” d ° n -domly^sub 

sets ot features, and principal components. While we expected strong ensemble 
p r ormance, input decimation also provided improvements in the base classi 

IZ m “ y cas m by f ani ” e or irrelevant features, thus ZpiaTg 

he learning problem faced by each base classifier. In this study weXuTon 

btcauTeZtUhZT for . t ™, reasons: 0) despite its simplicity (or perhaps 
) this combiner has been shown to perform well and hold its own 
against a wide array of more sophisticated methods [16, 17]; and (ii) by choosing 
a simple combiner we isolate the effects of input decimation from those of the 
combining method. Furthermore, pattern-level ensemble methods such as big- 
° 0S f DS? ^ Stac ^ ng can be useci in conjunction with input decimation 

to ttose L^dst ThL“f Semb,e me * h ° d (U " i ” P " t <5£S 

„ • , . ^ ore ’ one can ma ke meaningful comparisons between 

gmg combiners with and without input decimation, or say, between stack 

no ! blt b wee SmS W f, and mthoutm ^ t decimation (not reported in this article) 
not between input decimated ensembles and bagging or boosting. ’ 

fb.1T l ^ SUmmarize a theor y of classifier ensembles that highlights 
e connection between correlation among base classifiers and ensembkfperfor- 

ZT2 Se7 o :f " bri6f ° Ve T e : ° f dimensionality reductk^ 

in Section fw. We ***** the details of input decimated ensemble, and 
m Section 4 we provide experimental results on two real underwater sonar data 

sets three data sets from the PROBENl/UCI benchmarks [6, 54], and two 

Sldudewiir T Cb all °T a SySt6matic study of in P ut decimation. We 
nne „ d h a discussion on the effectiveness of input decimation under vari- 

Cllmstances with future research directions in Section 5. 


2 



2 Background 


Model selection is a ubiquitous problem in many pattern re^cm 
Neither the selection of the method e.g., multi-layer perceptron, nearest neigh 
boTalgorithm), nor the tuning of that algorithm can yet be fully automated 
for all problems [15 20, 23]. The use of ensembles provides partial relief sin 

U ^oUng^the classifiers b ] fore a decision is made, potential sensitmty o any 

single model is greatly reduced. Of course, the more similar the classifiers are, 
the 6 less likely it is that new information will be present in the ensemble, resu 
tl inMtle more than a “rubber stamping” committee. 
formalize this connection between the correlation among e < : a 
and ensemble performance and then discuss various methods that aim 

that correlation. 

2.1 Correlation and Ensemble Performance 

In this article we focus on classifiers that model the a posteriori probabili- 
ties of the output classes. Such algorithms include 

nronerlv trained feed forward neural networks such as Multi-Layer Perceptro 
(MLPs) [*T& cam model the ith output of such a classifier as follows (details 

of this derivation are in [63, 64]): 


fi(x) = P{Ci\x)+rH(x) t 

where P(C \x) is the posterior probability of the ith class given pattern x and 
is the error Sedated with the ith output. Given an input x, if we ^have 
one classifier we classify x as being in the class i whose value /»(x) is arg - 
° iusTeld i we use an ensemble that 

the outputs of N classifiers //"(x) , m € { then ( ,| ) g 


/r(x) 


. N 

= -I /f(x) = P(Ci\x) + Tji{x) 


(1) 


m= 1 


where: 

biO) = H 

m=l 

and r,™ (x) is the error associated with the ith output of the mth classifier. 
Now, the variance of f}i(x) is given by [64]: 


a 
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If we express the covariances in terms of the correlations (cov(x,y) = corrfx v)a a ) 

“ the tame variance across classifiers, and use the averse correlation J 
factor among classifiers, S it given by 


1 


N 


Si N(N - 1) EE 


m~l l^£m 


then the variance becomes: 


N <7 ’ 1 ' f*) 


4-X-Zlj!.^ _l + S i (N-l) , 

N n *«(*)• 


(2) 


( 3 ) 


Based on this variance, we can compute the variance of the decision boundary 

Zt£T% I , t0 the daSSifier err ° r ’ we obtain the relationship 

, . he model error (beyond the Bayes error) of the ensemble (E ave , hand 

that of an individual classifier (E model ) [63, 64]: model 


E ave , - ( — p 

m odel I ^ J -L^model 


( 4 ) 


where 




i- 1 


( 5 ) 


and Pi is the prior probability of class i. 

Equation 4 quantifies the connection between error reduction and the cor- 
relation among the errors of the base classifiers. This result leads us to seek to 
reduce the correlation among classifiers prior to using them in an ensemble. 

2.2 Correlation Reduction Methods 

As shown above, if the classifiers to be combined repeatedly provide the same 
(either erroneous or correct) classification decisions, there is little to be gained 
om combining, regardless of the chosen scheme. As equation 4 shows, reducing 
and increasing N are two ways to improve the performance of a classifier 
ensemble However, these two ways are not independent. This phenomenon 
is best illustrated by Figure 1, where the error reduction depending on the 
correlation among the classifiers is displayed as a function of the number of 
classifiers (based on Equation 4). For example, even though increasing the 
number of classifiers from 4 to 8 does not provide any sizeable gains when the 
correlation is .9, it provides significant gains if the correlation is .1. That is 
keeping the correlations low not only provides better error reduction for a given 
number of classifiers, but provides greater gains when adding classifiers. 

To improve ensemble performance one must either actively promote diversity 
during training or achieve diversity through the selection of the data presented to 
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Figure 1: Effect of correlation on error reduction. 


the base classifier training algorithms. Examples of the former incinfiefiistorting 
the output space through error-correcting output codes [18], usmg P P 
component analysis on the output space [45], using genetic algorithms to trmn 
the classifier [50, 61] or modifying the error function used f “ tmn “| 1 j‘ 
Examples of the latter include bagging [9], cross-validation partitioning [ , ] 

and even boosting [22] (though the goal there is not to reduce correlation the 
net effect is the same). The most common data selection methods focus on the 
“pattern” space, though dimensionality reduction methods which mampula 
it- space can L be used. Feature space methods have «he^™rtag 
that they do not reduce the number of patterns available for training each 
classifier 7 They generally fall into one of two different classes of methods: feature 

selection or feature extraction. , . , pr .t 

Feature extraction algorithms such as Principal Components Analysis (PGA) 
[5 31 52] or Independent Component Analysis (ICA) [28] reduce the dime 
fetSit if the data by creating new features Linear PCA, 
commonly used feature extraction method, creates new features that are linear 
combinations of the original features. The aim of PCA, however is : to dev, 
features on which the data shows the highest variability, whether those feature 
are useful for classification or not [5], Furthermore because all 
present in the initial features is “crammed” into fewer 

there is a danger that classifiers trained on the principal components will have 
higher, not lower correlations among them. Figure 2 demonstrates the perds 
not using class information. The left half of the figure shows a case m which PCA 
works effectively. In this case the first principal component (^corresponds to 
"b t with the highest discriminating power. The right half shows ; a sum- 
lar data set (similar data distribution and linearly separable). However, becaus 
principal component is not “aligned" with the class labels, select, ng tins 
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Figure 2. PC A and classification: The first principal [y\) can provide a good 
discriminating feature (left) or a poor one (right), since the class membership 
information is not used. 


component is a poor choice for this problem. Indeed, an input set consisting 
of only the first component would provide practically random decisions on this 
data set. Yet, PCA remains one of the most frequently used dimensionality 
reduction methods in many classification domains, including medical and space 
applications [55, 60]. 

Feature selection algorithms focus on selecting a subset of the features to 
present to the classifiers. One example is the random subspace method [10, 27] 
where random subsets of the original features are presented to the classifiers. 
However, looking at yi and y 2 (assuming those two are the original features) in 
Figure 2 shows a pitfall of random feature selection. Randomly selecting feature 
yi in the class configuration shown in (a) will lead to satisfactory classification, 
whereas randomly selecting feature yi in (b) will lead to all discriminating in- 
formation being lost. Many other feature selection methods use various criteria 
for deciding the relevance of each feature to the task at hand and choose some 
subset of the features according to those criteria [3, 7, 8, 19, 30, 43]. The 
subset selection can be distinct from the learning, which is the case with fil- 
ter methods. However, most of these feature selection methods attempt to 
choose features that are useful in discriminating across all classes. Using such 
a method within an ensemble learning scheme would have limited effectiveness 
since it would choose the same features for every base classifier, leading to rela- 

There are variations on PCA that use local and/or nonlinear processing to improve di- 
mensionality reduction [13, 33, 34, 47, 48, 59]. Although these methods implicitly account 
for some class information and therefore are better suited than global PCA methods for clas- 
sification problems, they do not directly use class information. 
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tivelv small correlation reduction. One exception is to break an L-dass problem 
* nt0 (i) two-class problems and perform feature selection wrthin each of those 
moblems [411. In man, real-world problems, there are features that are use- 
ful at distinguishing whether a pattern is of one partr^m bu^are 

useful at distinguishing among the remaining classes. In the next sectio 
present input decimation, which takes advantage of this fact to reduce both 
dimensionality and correlation in classifier ensembles. 

3 Input Decimated Ensembles 

Input Decimation decouples the classifiers by exposing them to dfler ent*spec« 
of the same data. ID trains L classifiers, one corresponding to each class m an 
L cliToblem.^ For each classifier, the method selects a user-determined 
number of the input features having the highest absolute correlation to the 
presence °or absence of the corresponding class.* The objective is to weed 
out input features that do not carry strong discriminating information for 
particular class, and thereby reduce the dimensionality of the feature space to 
facilitate the learning process. Additionally, the classifiers’ features are selected 
criteria, which leads to different feature subsets for each 

base classifier and a reduction in their correlations. 

Let the training set take the following form: 

{(xi,yi),(X2,y2), (Xm>ym)}i 

where m is the number of training examples. Each x* has \\F\\ d^ents ^re 
F is the set of input features) representing the values of the input features 
e v ani n l p i Each yi represents the class using a distributed encoding i.e., it has 
if elements where L is the number of classes, yu = 1 if example i belongs 0 
c^TSd w = 0 otherwise. In this study our base classifiers consist of MLPs 

trained with the backpropagation algorithm. _ 

Given such a data set, and a base classifier learning algorithm, input deci 

mated ensembles operate as follows: 

• For each class l € {1,2 

1 Compute the absolute value of the correlation between each feature 
j (xiffor all patterns i) and the output for class l (yu for all patterns 

l). 

2 Select the n t features having the highest absolute correlation, result- 
ing in new feature set F h One can either predetermine n, based on 


2 More generally, one trains n L classifiers where n is a h S j ses j n a two-class 

wi,h 

- 1— 1 - -> 

> used. 
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prior information about the data set, or learn the value to optimize 
performance. 

3. Construct a new training set by retaining only those elements of the 
Xi s corresponding to the features F t and all the outputs. 

4. Run the base classifier learning algorithm on this new training set 
Call the resulting classifier f 1 . 5 * 

Given a new example x, we classify it as follows: 

• For each class k € {1, 2, . . . , L}, calculate f?^( x ) = 1 yf f[( x ) bv 
presenting the proper features F t of example x to each of the L classifiers 

• Return the class K = argmax k fg ve (x). 

• H ^™ da ff nta J y ’ i" put decimation seek s reduce the correlations among 

kvel meftn^ 8 t rS Subsets of input featu res, while pattern 

ods such as bagging and boosting attempt to do so by choosing dif- 
ferent subsets of training patterns. 


4 Experimental Results 

In this section, we present the results of input decimation on two underwater 
sonar data sets, three Probenl/UCI benchmark data sets and two synthetic 
data sets, In all results reported below, the base classifiers consist of Multi-Layer 
Perceptions (MLPs) with a single hidden layer trained with the backpropagation 
gorithm. The learning rate, momentum term, and number of hidden units were 
e^erimentany determined. ^ all cases, we report test set error rates averaged 
over *0 runs, along with the differences in the means. 7 

Passive Sonar Signals 

it r! W f Pr ? blem r ith a11 the chara cteristics required for a complete study 
hat of classifying short duration underwater signals obtained from passive 
sonar signals [14], Both biological and non-biological phenomena produce such 
short duration sounds, and experts can determine the cause by studying their 
pulse signatures or spectrograms. Automating this classification process is a 
ifficult process because these signals are highly non-stationary, have different 
spectral characteristics dependi ng on sources or propagation paths and may 

5 If one is training nl classifiers for n > 1, then the algorithm calls the base classifier 

learning algorithm n times to create n classifiers ...,/'* with feature set F, 

j £ We ^ mstead Naming nh classifiers for n > 1, then we calculate fi w {x) = 
^TZ2l=l^2, = if‘k n (x)- 

• t , That f ° r an error with mean A and variance <7 2 , we report the p ± ol-/K where K 
is the number of repetitions (K= 20 for experiments reported here). Confidence inters of 
aesirea sensitivity can be obtained directly from the differences in the means. 


have significant overlap. A more detailed description of the son Jf and 
the difficulty associated with their classification can be found m [ > J. 

The two data sets used for this experiment are both extracted from sho 
duration passive sonar signals due to four naturally occurring 0ce f“^ ^I£ere 
(sound of ice cracking, porpoise and two different whale sounds). Although th 
is some complementarity among the data sets, for the purposes of this study 
U ^ treat them as different data sets.* The first set, SONAR1, consists of 
25 features including 16 Gabor wavelet coefficients, signal duration and oth 
tlmToS descriptors and spectral measurements. There were 496 training and 
8 9 3 Lt patterns. The second set, SONAR2, consists of 24 features, including 
reflection coefficients corresponding to the maximum broadband energy segment 
using both short and long time windows, signal duration and ot P 

descriptors. There were 564 training and 823 test patters. For both data sets, 

we used an MLP with 50 hidden units. wrP pi»timi 

Tables 1 shows the error rates, differences in the mean, and correlatio 

among the base classifiers for both the full feature set and the input .decnuf * 
set. In this case, each base classifier had an input decimated set of 22 Matures fo 
both SONAR1 and SONAR2 after features with little correlation to each ou 
put were deleted. Retaining more features did not result in a S1 6 mfic “ vidu 5 
in correlations, whereas removing more features resulted in drops in md vidual 
classifier performance that were too large to be compensated by com i g- 
fact, this data set is not particularly well-suited for input decimation because it 
has a small number of carefully-extracted, relevant features. 


Table 1: Ensemble Performance on both sonar data. 



1 Full Feature Set 

Input Decimation 

Error Rate 

<5 

Error Rate 

4 


1 

7.47 ± .10 


8.38 ± -15 


SONAR1 

4 

7.05 ± .07 

.89 

7.10 ± .07 

.68 


8 

7.17 ± .05 


6.99 ± .06 



1 

9.95 ± .16 


9.73 ± .16 


SONAR2 

4 

9.26 ± .15 

.76 

8.80 ± .06 

.72 


8 

8.94 ± .11 


8.62 ± .06 



For SONAR1, the deletion of even lowly-correlated inputs affects the perfor- 
mance of the base classifier significantly. However, due to the corresponding y 
large reduction in the error correlation, input decimated ensembles perforrn 
the* level of the full feature set, with IDE for N = 8 providing a statistically 
significant gain over the full feature set ensemble and IDE for A ' - 4 at the 
„ - 05 level For SONAR2, the gains are more significant in that even the 
“pit decried base classifier improves slightly upon the full 
clLifier, allowing for sizable gains by the input decimated ensemble. This 

” ” not assume sign, 
stationarity [12]. 
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achieved m spite of the relatively modest drop in the error correlation among 
the base classifiers . Also, note that for SONAR1, because the correlation is high 
for the base classifiers trained on the full feature set, increasing the number of 

eqtfiralent^rrortr § d ° eS Pr ° Vide any gainS ^ instead H Provides statistically 


4.2 Probenl/UCI Benchmarks 

In the SONAR data presented above each feature carried a significant amount 
of discriminating information. In fact, because each feature was carefully ex- 
tracted from the raw data, one should not have expected much improvement 

rough input decimation. In this section we perform a more detailed analysis 
on three benchmark data sets where we gradually decrease the dimensionality 
until we end up with 5-10% of the original features. On these benchmark sets 
we expect this more extreme case of input decimation to expose the strengths 
and weaknesses of this method. 

The three data sets from the UCI/PROBEN1 benchmarks [6, 54] selected for 
this study ^ere: T h e Gene dataset from the PROBEN1 (i.e., using train/test 
sp it from PROBEN1), and the Splice junction gene sequences and Satellite 
Image datasets (Statlog version) from the UCI Machine Learning Repository. 

e Gene data set has 120 input features and three classes [46, 54] The MLP 
has a single hidden layer of 20 units, a learning rate of 0.2 and a momentum 
erm o . . he Splice data consists of 60 input features and three classes [61. 
ere we selected an MLP with a single hidden layer composed of 120 units a 
iearnmg rate of 0.05, and a momentum term of 0.1. The Satellite Image data 
set has 3 6 input- features and 6 classes [6j. We selected an MLP with a single 
ldden layer of 50 units, and a learning rate and momentum term of 0.5. The 
ensembles consisted of three classifiers for Gene and Splice and six classifiers for 
bateihte Image— the same as the number of classes. 

Figures 3-5 show the classification performance and classifier correlations for 
all three data sets, averaged over 20 runs. For clarity we omit the error bars 
since they ranged from 0.05 to 0.25% and as such were smaller than the symbols 
representing the data points. The rightmost point in each graph (e.g., the point 
corresponding to 120 features for the Gene data set) shows the full feature set 
per ormance. For the Gene data, the full feature ensemble is significantly more 
accurate than the single classifier, while for the Satellite Image and Splice data 
sets, the ensemble is only marginally more accurate. 

In case of the Gene data, the average ensembles with 20, 30, and 40 in- 
puts are significantly more accurate than both the original network ensembles 
described in the previous section and their PCA counterparts. With IDE, the 
performance of the ensemble goes up as the number of features increases until 
all the relevant features are included and then starts declining with the addition 
of irrelevant features. The average correlation behaves the same way. For 10 or 
fewer features, we expect the average correlation to be low because different sets 
of 10 features have the highest relevance to each class. As the number of features 
increases up to 30, the base classifiers have an increasing number of common 
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Figure 3: Performance (a) and Correlations (b) for the Gene data set 
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Classification Performance for Satellite Data 
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Classifier Correlations for Satellite Data 



Number of Features 


Figure 5: Performance (a) and Correlations (b) for the Satellite data set 




features. At 30, the base classifiers have virtually all common features — the 30 
that are relevant to all three classes; therefore, we would expect maximum av- 
erage correlation. Beyond 30, each base classifier is getting (probably different) 
irrelevant features, leading to a reduction in correlation. With PCA, the perfor- 
mance of the ensemble is relatively stable and inferior to IDE. This is consistent 
with the fact that principal components are not necessarily good discriminative 
features, and adding principal components beyond the first few would likely 
have little effect on the classification performance. The performance of the en- 
semble with random feature subsets increases in random increments with the 
addition of features depending on how relevant they are. On this dataset, the 
performance of random feature ensembles was uncompetitive because random 
selection never yielded good feature subsets. 

In the Splice data experiments, all the decimated feature-based ensembles 
significantly outperformed both the original ensemble and the PCA-based en- 
sembles. Random feature-based ensembles performed somewhat better here 
than in the Gene data set. With 40 and more features, it was competitive 
to input decimation. However, the best performing predictor overall is clearly 
the input-decimated ensemble with 10 inputs per classifier. What is particu- 
larly notable in this case is that a reduction of dimensionality based on PCA 
has a strong negative impact on the classification performance. With 20 prin- 
cipal components for example, the performance of the single classifiers drops 
by 7% relative to the single classifier with all the input features, whereas the 
performance of the ID single classifier increases by 3%. The improvement of 
the performance of the single classifiers due to decimation is an initially sur- 
prising aspect of these experiments since one may not expect to find too many 
“irrelevant” features in these real data sets. However, an analysis shows that 
the inputs that were decimated were in fact providing “noise” to the classifier. 
Although it is theoretically true that the classifier with more information will 
do at least as well as the classifier with less information, in practice with only a 
limited amount of data, extracting the correct information can cause a problem 
for such classifiers causing them to perform worse than their counterparts with 
less information. 

On the Satellite Image data however, the input decimated ensemble with 
27 features was the only one that did not perform significantly worse than the 
single classifier and the original ensemble. Both the PCA and random fea- 
ture ensembles outperformed IDE. Because the single IDE classifiers performed 
much worse than the PCA and random feature single classifiers, we examined 
the features that were chosen in each ensemble. Figure 6 shows the average cor- 
relations among the features chosen for the base classifiers in the three types of 
ensembles. The features that IDE chose have a much higher correlation among 
themselves relative to random and PCA ensembles, especially for smaller num- 
bers of inputs. This means that IDE often chooses several features with high 
correlations to the class without realizing that they may be redundant. Ran- 
dom feature selection does not fall into this trap since it does not consider 
correlations at all. PCA’s correlations are the lowest because it creates features 
specifically designed to have low correlations among each other. Among the 
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Average Feature Correlations for Satellite Data 



Figure 6: Average Correlation of Features in Satellite Image Data. 


three Probenl/UCI datasets, this is the one with the lowest dimensionality, and 
shows two things: (i) in order to take advantage of input decimation, the initial 
dimensionality has to be high, as there are likely to be more irrelevant features 
that can be removed; and (ii) if there are features that have significant meaning, 
they need to be included in the feature set regardless of their correlation to the 
particular output. We observed that consecutive groups of four features in the 
satellite image data set correspond to spectral values for a given pixel. In ex- 
amining the eigenvalues and eigenvectors, we found that the highest eigenvalue 
was 91.6% of the sum of the eigenvalues, and the corresponding eigenvector was 
a simple linear combination of the four spectral values across all the pixels. In 
this case, the higher principal components provide good discriminative features 
(i.e., the data “looks” like that in Figure 2(a)). A potential improvement to 
input decimation is to select “wild card” features based on correlation with all 
the classes and include them in each decimated subset. 

4.3 Synthetic Data 

In this section we construct synthetic data sets to enable us to study the prop- 
erties of input decimated ensembles in a systematic manner. To that end we 
use the following two synthetic data sets: 

• Set A: 


15 



~ Three classes-one unimodal Gaussian per class. 

“ 300 training patterns and 150 test patterns-100 training and 50 test 
patterns per class. 

“ 100 features per pattern where there are: 

* 10 relevant features per class. Patterns that belong to a class 
are generated from a multivariate normal distribution in 10 in- 
dependent dimensions distributed as iV(40, 5 2 ). There are no di- 
mensions in common among the three classes. Therefore, there 
are 30 relevant features. For patterns in each class, the 20 fea- 
tures that are relevant to the other two classes are distributed as 
Z7[-100,100]. 10 

* 70 irrelevant features-distributed as C/[ — 100, 100J. 

• Set B: Same as Set A, except that there is overlap among the relevant 
features for each class. That is, each class has three relevant features in 
common with every other class, but there are no features that are relevant 
to all three classes. 

In data set A there is an abundance of features that are irrelevant for the 
classification task. This data set was chosen to represent large data mining 
problems where the algorithms may get swamped by irrelevant data. In data 
set B the overlap among features relevant to each class provides a more diffi- 
cult problem where the base classifiers are now forced to select some common 
features, reducing the potential for correlation reduction. 

4.3.1 Synthetic Set A 

Figure 7 presents the classification accuracies and base classifier correlations on 
Synthetic dataset A as a function of the number of inputs (which are either 
the number of selected principal components or the number of features selected 
for each base classifier through input decimation). The original single classifier 
and original ensemble use all the input features. 11 The points for the maximum 
number of features (e.g., 100 features in this data set), always represent the 
performance of the original classifier/ensemble. 

An important observation that is apparent from these results is that neither 
PCA ensembles nor PC A base classifiers are particularly sensitive to the number 
of inputs. The correlations among the base classifiers reinforce this conclusion. 
Fewer input features in PCA means the base classifiers are more correlated since 
they all share the same principal features. Note however, that input decimated 
base classifiers have low correlation for small numbers of features, increasing 
correlation up to 30 features, and decreasing correlation after that. The base 

10 Clearly, because of this, all 30 features have some relevance to all three classes; however, 
the 10 features used to generate patterns belonging to each class are clearly substantially more 
relevant than the other 20 features. 

11 The base classifier used was an MLP with a single hidden layer consisting of 95 units, 
trained using a learning rate of 0.2 and a momentum term of 0.5. 
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classifiers’ average performance follows a similar pattern. Interestingly, input 
decimated ensembles are not adversely affected by tire poor performance of the 
base classifiers (e.g., input decimated ensembles with 5 features outperformed 
input decimated ensembles with 50 features while base classifiers with 5 features 
gave significantly worse results than base classifiers with 50 features). 

In cases where more than 30 features were used, the performance of the 
ensemble declined with the addition of additional features, i.e., as more and more 
irrelevant features were included. However, all the input decimation ensembles 
provided statistically significant improvements over the original ensembles, PCA 
ensembles, and random-feature ensembles. 

The single decimated classifiers with 20 and more features outperformed 
the original single classifier. This perhaps surprising result (as one might have 
expected only the ensemble performance to improve when using subsets of the 
features) is mainly due to the simplification of the learning tasks, which allows 
the classifiers to learn the mapping more efficiently. 

Interestingly, the average correlation among classifiers does not decrease un- 
til a very small number of features remain. We attribute this to the removal 
of noise — removing noise increases the amount of information shared between 
the base classifiers. Indeed, the correlation increases steadily as features are 
removed until we reach 30 features (which corresponds to the actual number of 
relevant features). After that point, removing features reduces the correlation 
because the base classifiers’ feature sets have a decreasing number of common 
features. The base classifiers’ performances also decline; however, the ensemble 
performance still remains high. This experiment clearly shows a typical trade-off 
in ensemble learning: one can either increase individual classifier performance 
(as for input decimation with more than 30 features) or reduce the correlation 
among classifiers (as for input decimation with less than 20 features) to improve 
ensemble performance. 

4.3.2 Synthetic Set B 

Figure 8 presents the results for the second synthetic data set, which is similar 
to the first data set except that there is overlap among the relevant features for 
the classes. 12 Because of this overlap, this feature set has fewer total relevant 
features and thus it constitutes a more difficult problem (as indicated by com- 
paring the results on the full feature base classifiers and ensembles on this data 
set to the previous one). 

Note that the correlations in this data set remained fairly constant across 
the board for IDE and PCA-based ensembles. Input decimation did not reduce 
the correlations dramatically for small feature sets in dataset B the way it did 
in case of dataset A. This is mainly caused by the “coupling” among the base 
classifiers due to their common input features. 

In spite of these difficulties, input decimation ensembles perform extremely 
well. Indeed, they significantly outperform the original ensemble, PCA ensem- 

12 The single classifier used was an MLP with a single hidden layer consisting of 95 units, 
trained using a learning rate of 0.2 and a momentum term of 0.5. 
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bles, and random-feature ensembles on all but a few subsets where they only 
provide marginal improvements. Furthermore the input-decimated single classi- 
fiers also outperform their original and PCA counterparts for all but the 60 and 
/0 feature subsets. This is particularly heartening since this feature set is a more 
representative abstraction of real data sets (data sets with “clean” separation 
among classes are quite rare). This experiment demonstrates that when there 
is overlap among classes, class information becomes particularly relevant. PCA 
and random feature selection operate without this vital information, therefore 
they are unlikely to provide competitive performance. 


5 Discussion 

This paper discusses input decimation, a dimensionality reduction-based ensem- 
ble method that provides good generalization performance by reducing the cor- 
relations among the classifiers in the ensemble. Through controlled experiments, 
we show that the input decimated single classifiers often outperform the single 
original classifiers (trained on the full feature set), demonstrating that simplv 
eliminating irrelevant features can improve performance. In addition, elimi- 
nating irrelevant features in each of many classifiers using different relevance 
criteria (in this case, relevance with respect to different classes) yields signif- 
icant improvement in ensemble performance through correlation reduction, as 
seen by comparing our decimated ensembles to the original ensembles. Selecting 
the features using class label information also provides significant performance 
gains over PCA-based ensembles and random feature subset selection. 

Through our tests on synthetic and real data sets, we examined the char- 
acteristics that data sets need to have to fully benefit from input decimation. 
We observed that input decimation yields the greatest improvements over the 
original ensemble when (i) there are a large number of features (i.e., where it is 
likely that there will be irrelevant features); and (ii) when the number of training 
examples is small relative to the input dimensionality (i.e., where it is difficult 
to properly learn all the parameters in a classifier based on the full feature set). 

In both cases, by removing the extraneous features, input decimation reduces 
noise and thereby reduces the number of training examples needed to produce a 
meaningful model (i.e., alleviating the curse of dimensionality). Our synthetic 
data sets were generated using multivariate distributions where the feature val- 
ues were generated independently. We plan to generate synthetic data sets with 
dependencies among the features to see how they affect our method. Our ex- 
periments with real datasets — especially the Satellite Image dataset— showed 
that input decimation may benefit by keeping out redundant features and in- 
cluding those features that have a high correlation with all classes on average 
even though they do not have high correlation with any one class. We plan to 
investigate various possible methods of doing this. 

Note that input decimation shares the central aim of generating a diverse 
pool of classifiers for the ensemble with many methods such as bagging. How- 
ever, by focusing on the input features rather than the input patterns, input 
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decimation focuses on a different “axis” of correlation reduction than does bag- 
ging Consequently, input decimation is orthogonal to bagging, and one can u 
input decimation in conjunction with bagging. We plan to experiment with this 

m 'Afinarobservation is that input decimation works well in 

crude method of feature selection (i.e., using statistical correction of each fea 

ture individually with each class). One reason f a 5 

is that we have greatly simplified the relevance cntenon. m*ke 
selection methods that consider the discriminator, ability across ad 
onlv consider the relevance of the features to a single class This typically causes 
™ h Zifier in the ensemble to get a different subset of features, leading o 
he snpTr or performance we have demonstrated. Nevertheless we are currently 
extending'this work in four directions: considering cross-correlations among the 
features: investigating mutual information-based rel, svernce cnt erm; ■ 
ing global relevance into the selection process; and selecting a different 
of features for each classifier. 
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